wine.data = read.csv('wine.csv')
wine.data$Type = as.factor(wine.data$Type)
scaled.wine.data = data.frame(scale(wine.data[,-c(1)]))
pca = princomp(scaled.wine.data)
plot(pca)
plot(cumsum(pca$sdev^2/sum(pca$sdev^2)))
pca$loadings
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## Alcohol 0.144 0.484 0.207 0.266 0.214
## Malic.acid -0.245 0.225 -0.537 0.537 -0.421
## Ash 0.316 -0.626 0.214 0.143 0.154 0.149
## Alcalinity.of.ash -0.239 -0.612 -0.101 0.287
## Magnesium 0.142 0.300 -0.131 0.352 -0.727 -0.323
## Total.phenols 0.395 -0.146 -0.198 0.149
## Flavanoids 0.423 -0.151 -0.152 0.109
## Nonflavanoid.phenols -0.299 -0.170 0.203 0.501 -0.259 -0.595
## Proanthocyanins 0.313 -0.149 -0.399 -0.137 -0.534 -0.372
## Color.intensity 0.530 0.137 -0.419 0.228
## Hue 0.297 -0.279 0.428 0.174 0.106 -0.232
## OD280.OD315.of.diluted.wines 0.376 -0.164 -0.166 -0.184 0.101 0.266
## Proline 0.287 0.365 0.127 0.232 0.158 0.120
## Comp.8 Comp.9 Comp.10 Comp.11 Comp.12 Comp.13
## Alcohol 0.396 0.509 0.212 0.226 0.266
## Malic.acid -0.309 -0.122
## Ash -0.170 -0.308 0.499 -0.141
## Alcalinity.of.ash 0.428 0.200 -0.479
## Magnesium -0.156 0.271
## Total.phenols -0.406 0.286 -0.320 -0.304 0.304 -0.464
## Flavanoids -0.187 -0.163 0.832
## Nonflavanoid.phenols -0.233 0.196 0.216 -0.117 0.114
## Proanthocyanins 0.368 -0.209 0.134 0.237 -0.117
## Color.intensity -0.291 -0.604
## Hue 0.437 -0.522 -0.259
## OD280.OD315.of.diluted.wines 0.137 0.524 -0.601 -0.157
## Proline 0.120 -0.576 0.162 -0.539
##
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.077 0.077 0.077 0.077 0.077 0.077 0.077 0.077 0.077
## Cumulative Var 0.077 0.154 0.231 0.308 0.385 0.462 0.538 0.615 0.692
## Comp.10 Comp.11 Comp.12 Comp.13
## SS loadings 1.000 1.000 1.000 1.000
## Proportion Var 0.077 0.077 0.077 0.077
## Cumulative Var 0.769 0.846 0.923 1.000
pc = prcomp(scaled.wine.data)
plot(pc)
plot(cumsum(pc$sdev^2/sum(pc$sdev^2)))
# First 7 principal components
comp <- data.frame(pc$x[,1:7])
# Plot
plot(comp, pch=16, col=rgb(0,0,0,0.5))
Apply PCA to “wine” dataset with all numerical variables and reduce the dimension.
How many principal components need to explain more than 85% of variances?
I used 7 principal components to account for roughly 90% of variance.
autoplot(pc, data = wine.data, colour = 'Type')
There looks to be three clusters which also correspond to the three Types. I can see some mix of a few points but overall, the split is pretty good at only two dimensions.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::combine() masks gridExtra::combine()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## Registered S3 method overwritten by 'parsnip':
## method from
## autoplot.glmnet ggfortify
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom 1.0.5 ✔ rsample 1.2.0
## ✔ dials 1.2.0 ✔ tune 1.1.2
## ✔ infer 1.0.4 ✔ workflows 1.1.3
## ✔ modeldata 1.2.0 ✔ workflowsets 1.0.1
## ✔ parsnip 1.1.1 ✔ yardstick 1.2.0
## ✔ recipes 1.0.8
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ dplyr::combine() masks gridExtra::combine()
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ rsample::permutations() masks e1071::permutations()
## ✖ dials::prune() masks rpart::prune()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## ✖ tune::tune() masks parsnip::tune(), e1071::tune()
## • Dig deeper into tidy modeling with R at https://www.tmwr.org
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following objects are masked from 'package:yardstick':
##
## precision, recall, sensitivity, specificity
##
## The following object is masked from 'package:purrr':
##
## lift
library(nnet)
# Split the data into training and test set
set.seed(123)
pca.data = cbind(wine.data$Type, comp)
names(pca.data)[1] = 'Type'
training.samples <- wine.data$Type %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- pca.data[training.samples, ]
test.data <- pca.data[-training.samples, ]
mlr = multinom_reg() |>
fit(Type~., data = train.data)
# Summarize the model
tidy(mlr, exponentiate = TRUE, conf.int = TRUE) |>
mutate_if(is.numeric, round, 4) |>
select(-std.error, -statistic)
## # A tibble: 16 × 6
## y.level term estimate p.value conf.low conf.high
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2 (Intercept) 4.48e+ 0 0.999 0 Inf
## 2 2 PC1 1.00e+ 4 0.980 0 Inf
## 3 2 PC2 1.81e+12 0.992 0 Inf
## 4 2 PC3 2.07e+ 2 0.992 0 Inf
## 5 2 PC4 3.00e- 1 0.994 0 8.19e128
## 6 2 PC5 0 0.901 0 1.30e 72
## 7 2 PC6 3.82e+ 1 0.995 0 Inf
## 8 2 PC7 3.11e+ 1 0.999 0 Inf
## 9 3 (Intercept) 3 e- 4 0.994 0 Inf
## 10 3 PC1 4.13e+ 7 0.966 0 Inf
## 11 3 PC2 0 0.997 0 Inf
## 12 3 PC3 2.28e- 2 0.994 0 Inf
## 13 3 PC4 1.83e- 1 0.993 0 4.74e162
## 14 3 PC5 3 e- 4 0.989 0 Inf
## 15 3 PC6 9.88e+ 3 0.988 0 Inf
## 16 3 PC7 0 0.997 0 Inf
Type_preds <- mlr |>
augment(new_data = test.data)
conf_mat(Type_preds, truth = Type, estimate = .pred_class)
## Truth
## Prediction 1 2 3
## 1 11 0 0
## 2 0 14 0
## 3 0 0 9
roc_auc(Type_preds, truth = Type, .pred_1, .pred_2, .pred_3)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc hand_till 1
roc_curve(Type_preds, truth = Type, .pred_1, .pred_2, .pred_3) |>
ggplot(aes(x = 1 - specificity, y = sensitivity, color = .level)) +
geom_line(size = 1, alpha = 0.7) +
geom_abline(slope = 1, linetype = "dotted") +
coord_fixed() +
labs(color = NULL) +
theme_light()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
possibilities <- expand_grid(
PC1 = seq(-4.31, 4.27, length.out = 10),
PC2 = seq(-3.51, 3.87, length.out = 10),
PC3 = seq(-4.57, 5.33, length.out = 10),
PC4 = seq(-2.88, 3.78, length.out = 10),
PC5 = seq(-4.18, 2.02, length.out = 10),
PC6 = seq(-1.91, 3.28, length.out = 10),
PC7 = seq(-1.63, 2.64, length.out = 10)
)
possibilities <- bind_cols(possibilities,
predict(mlr, new_data = possibilities))
possibilities |>
ggplot(aes(x = PC1, y = PC2)) +
geom_point(aes(color = .pred_class), alpha = 0.1) +
geom_point(data = test.data, aes(color = Type, shape = Type),
size = 2,
alpha = 0.8) +
labs(color = "Type", shape = "Type") +
theme_light()
svm_rbf_spec <- svm_rbf() %>%
set_mode("classification") %>%
set_engine("kernlab", scaled = FALSE)
svm_rbf_fit <- svm_rbf_spec %>%
fit(Type ~ ., data = train.data)
svm_rbf_fit
## parsnip model object
##
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 1
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0829987516109757
##
## Number of Support Vectors : 54
##
## Objective Function Value : -11.6864 -4.0413 -11.2833
## Training error : 0.006944
## Probability model included.
augment(svm_rbf_fit, new_data = test.data) %>%
conf_mat(truth = Type, estimate = .pred_class)
## Truth
## Prediction 1 2 3
## 1 11 0 0
## 2 0 14 1
## 3 0 0 8
augment(svm_rbf_fit, new_data = test.data) %>%
roc_curve(truth = Type, .pred_1, .pred_2, .pred_3) %>%
autoplot()
augment(svm_rbf_fit, new_data = test.data) %>%
roc_auc(truth = Type, .pred_1, .pred_2, .pred_3)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc hand_till 1
#make a knn spec
knn_spec <- nearest_neighbor() %>%
set_engine("kknn") %>%
set_mode("classification")
#use the knn spec to fit the pre-processed training data
knn_fit <- knn_spec %>%
fit(Type ~., data = train.data)
##
## Attaching package: 'kknn'
## The following object is masked from 'package:caret':
##
## contr.dummy
augment(knn_fit, new_data = test.data) %>%
conf_mat(truth = Type, estimate = .pred_class)
## Truth
## Prediction 1 2 3
## 1 11 0 0
## 2 0 13 0
## 3 0 1 9
augment(knn_fit, new_data = test.data) %>%
roc_curve(truth = Type, .pred_1, .pred_2, .pred_3) %>%
autoplot()
augment(knn_fit, new_data = test.data) %>%
roc_auc(truth = Type, .pred_1, .pred_2, .pred_3)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc hand_till 1
tree_spec <- decision_tree() %>%
set_engine("rpart")
class_tree_spec <- tree_spec %>%
set_mode("classification")
class_tree_fit <- class_tree_spec %>%
fit(Type ~ ., data = train.data)
class_tree_fit %>%
extract_fit_engine() %>%
rpart.plot()
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
## Call rpart.plot with roundint=FALSE,
## or rebuild the rpart model with model=TRUE.
augment(class_tree_fit, new_data = test.data) %>%
conf_mat(truth = Type, estimate = .pred_class)
## Truth
## Prediction 1 2 3
## 1 10 2 0
## 2 1 12 0
## 3 0 0 9
augment(class_tree_fit, new_data = test.data) %>%
roc_curve(truth = Type, .pred_1, .pred_2, .pred_3) %>%
autoplot()
augment(class_tree_fit, new_data = test.data) %>%
roc_auc(truth = Type, .pred_1, .pred_2, .pred_3)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc hand_till 0.953
#make a rf spec
rf_spec <- rand_forest() %>%
set_engine("ranger") %>%
set_mode("classification")
#use the rf spec to fit the pre-processed training data
rf_fit <- rf_spec %>%
fit(Type ~., data = train.data)
augment(rf_fit, new_data = test.data) %>%
conf_mat(truth = Type, estimate = .pred_class)
## Truth
## Prediction 1 2 3
## 1 10 0 0
## 2 1 14 0
## 3 0 0 9
augment(rf_fit, new_data = test.data) %>%
roc_curve(truth = Type, .pred_1, .pred_2, .pred_3) %>%
autoplot()
augment(rf_fit, new_data = test.data) %>%
roc_auc(truth = Type, .pred_1, .pred_2, .pred_3)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc hand_till 0.997
## Libraries for Plotting our Results
library(ggplot2)
library(ggfortify)
library(gridExtra)
library(GGally) # some nice plot extensions to ggplot
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(cowplot) # allows us to create matrix plots
##
## Attaching package: 'cowplot'
## The following object is masked from 'package:lubridate':
##
## stamp
library(varhandle) # provides unfactor() function
library(cluster) # provides bottom-up clustering support
library(factoextra) # support for plotting of clusters
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(dbscan) # support for the dbscan algorithm
##
## Attaching package: 'dbscan'
## The following object is masked from 'package:stats':
##
## as.dendrogram
set.seed(125) # so we get consistent results for rendering this tutorial
diabetes_data <- read.csv("diabetes.csv", sep=",")
colnames(diabetes_data)<-c("Pregnancies","Glucose","BloodPressure","SkinThickness","Insulin","BMI","PedigreeFunction","Age","Outcome")
diabetes_data$Outcome <- as.factor(diabetes_data$Outcome)
summary(diabetes_data)
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## Insulin BMI PedigreeFunction Age Outcome
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00 0:500
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00 1:268
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
diabetes_matrix = data.frame(scale(diabetes_data[, -c(9)]))
pc = prcomp(diabetes_matrix)
# Plot of Variance Explained
plot(cumsum(pc$sdev^2/sum(pc$sdev^2)), main = 'Cumulative Variance Explained by Components',
ylab="Explained", xlab="Component")
pc_plot1 <- autoplot(pc, data = diabetes_data, main = "Plot of Iris Data in Principal Components Space")
pc_plot2 <- autoplot(pc, data=diabetes_data, colour = "Outcome", main = "Plot of Iris Data in Principal Components Space")
plot_grid( # uses cowplot library to arrange grid
pc_plot1, pc_plot2,
nrow = 1
)
### K means
# Set a maximum number of groups to explore
k.max <- 10
# Fit a kmeans cluster for each number of groups and calculate wthin sum of squares
wss <- sapply(1:k.max, function(k){kmeans(diabetes_matrix, k)$tot.withinss})
wss
## [1] 6136.000 5144.194 4354.122 4090.571 3621.022 3357.598 3134.841 3000.041
## [9] 3005.599 2759.770
# Plot the results so we can find the "elbow"
plot(1:k.max, wss,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
# Apply silhouette method to determine clusters
fviz_nbclust(diabetes_matrix, kmeans, method = "silhouette")
# Calculate the K-Means Clusters
km.out <- kmeans(diabetes_matrix, 2, nstart=20)
km.out
## K-means clustering with 2 clusters of sizes 272, 496
##
## Cluster means:
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 0.9581251 0.4333542 0.4065341 -0.14287076 -0.02518500 0.11840510
## 2 -0.5254234 -0.2376459 -0.2229381 0.07834848 0.01381113 -0.06493183
## PedigreeFunction Age
## 1 0.03157570 1.0489862
## 2 -0.01731571 -0.5752505
##
## Clustering vector:
## [1] 1 2 1 2 2 2 2 2 1 1 2 1 1 1 1 2 2 1 2 2 2 1 1 1 1 1 1 2 1 1 1 2 2 2 1 2 1
## [38] 1 2 1 2 1 1 1 1 2 2 2 2 2 2 2 2 1 1 2 1 2 1 2 2 1 2 2 1 2 2 1 2 2 2 2 1 2
## [75] 2 2 1 2 2 2 2 2 1 2 1 2 1 2 1 2 2 2 1 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [112] 1 2 2 1 1 1 2 2 2 2 2 2 1 2 2 2 2 2 1 1 1 2 1 2 2 2 2 2 2 1 1 2 1 2 2 1 2
## [149] 1 2 2 2 1 2 1 1 2 2 2 1 1 1 2 2 2 1 2 2 2 2 1 2 2 2 2 1 1 2 1 1 2 2 2 2 1
## [186] 1 1 2 1 2 2 1 1 1 1 1 2 2 2 2 2 2 2 2 1 2 1 1 2 1 2 2 1 2 1 1 2 2 2 1 2 1
## [223] 2 1 2 2 2 2 1 2 2 1 2 2 2 2 1 2 1 2 2 2 2 2 2 1 1 2 1 2 1 2 2 2 1 2 2 2 2
## [260] 1 1 2 2 1 2 1 2 2 2 2 1 2 1 2 1 2 2 2 1 2 2 1 1 1 1 1 1 2 2 2 2 2 2 2 1 2
## [297] 2 2 1 1 2 2 2 1 1 2 1 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 1 2 2 2 1 2 1 1 2 2
## [334] 1 2 2 2 1 1 1 2 2 2 1 1 1 2 2 2 2 2 2 2 2 2 1 2 1 1 2 1 1 1 1 2 2 2 2 2 1
## [371] 2 2 2 2 2 1 2 2 1 2 2 2 2 2 2 2 2 1 1 2 2 1 2 2 1 2 2 2 2 2 2 1 1 1 1 2 1
## [408] 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 1 2 2 1 1 2 2 1
## [445] 2 2 2 2 2 2 2 2 2 1 2 1 1 2 1 1 1 2 1 2 1 2 2 2 2 1 2 2 2 1 2 1 2 1 1 1 2
## [482] 2 2 2 2 2 2 1 2 1 2 2 2 1 2 1 2 2 1 1 2 2 2 1 2 1 2 2 2 1 1 2 1 2 2 2 1 1
## [519] 1 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 1 2 2 1 1 2 1 1 2 2 1 2 2
## [556] 1 2 1 1 1 1 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 1 1 2 2 1 1 1 2 1 2 1 2 1 2
## [593] 1 2 1 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 1 2 1 2 1 2 1 2 2 2 1 2 2 2 2 2 1
## [630] 2 1 2 2 2 1 1 1 2 1 2 2 2 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 2 1 2 1 1 1 2
## [667] 1 1 1 1 1 2 1 2 1 1 1 2 2 2 2 2 2 2 1 2 2 2 2 1 1 1 2 1 2 1 2 2 2 2 2 1 1
## [704] 2 2 2 2 2 1 2 2 1 1 2 2 1 2 1 2 1 2 2 2 1 1 2 2 2 2 2 2 1 2 2 1 2 2 1 2 2
## [741] 1 2 2 1 1 1 2 2 1 1 2 2 2 2 1 2 1 1 2 1 2 1 1 1 2 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 2014.888 3107.164
## (between_SS / total_SS = 16.5 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
# Plot of assigned clusters
fviz_cluster(list(data = diabetes_matrix, cluster = km.out$cluster))
# Comparing within sum of squares to between sum of squares
# Between Cluster Sum of Squares
km.out$betweenss
## [1] 1013.948
# Within Cluster Sum of Squares
km.out$withinss
## [1] 2014.888 3107.164
# The Ratio
km.out$withinss/km.out$betweenss
## [1] 1.987171 3.064422
# Compare our results from km.out$cluster with the known values
results.compare<-data.frame(cbind(as.character(diabetes_data$Outcome), km.out$cluster))
table(results.compare)
## X2
## X1 1 2
## 0 127 373
## 1 145 123
kmeans_accuracy <- (373+145)/768 # make sure values match table due to results changing each time algorithm is applied
kmeans_accuracy
## [1] 0.6744792
# Dissimilarity matrix
d <- dist(diabetes_matrix, method = "euclidean") # can try different distance metrics
# Hierarchical clustering using Complete Linkage
hc_agg <- agnes(d, method = "complete" ) # can try different methods here
hc_agg
## Call: agnes(x = d, method = "complete")
## Agglomerative coefficient: 0.9041579
## Order of objects:
## [1] 1 755 702 161 757 213 370 549 441 507 550 703 57 546 663 133 426 196
## [19] 245 425 470 156 210 762 15 286 224 307 499 54 207 237 604 671 389 41
## [37] 400 228 108 569 723 131 261 516 697 313 647 428 749 612 70 190 111 123
## [55] 522 412 727 701 542 686 529 298 737 136 570 443 548 326 634 385 420 528
## [73] 705 731 766 167 414 728 398 472 482 603 23 340 155 27 179 208 599 85
## [91] 304 231 236 392 379 73 659 692 24 479 668 195 311 34 181 617 732 334
## [109] 635 763 65 220 584 144 247 251 77 177 506 560 275 636 519 31 720 363
## [127] 353 492 93 388 346 724 35 38 359 511 746 134 463 738 147 404 673 87
## [145] 559 713 299 615 741 271 591 315 543 22 345 675 279 300 510 558 513 255
## [163] 520 29 583 43 461 718 460 764 25 649 670 324 89 437 456 115 176 339
## [181] 500 186 239 26 282 215 283 96 556 162 568 205 669 192 694 664 541 595
## [199] 72 218 172 296 217 290 653 403 387 712 418 411 614 639 83 504 166 189
## [217] 478 44 55 613 216 376 459 517 153 260 745 160 3 45 193 405 318 676
## [235] 12 328 37 579 62 587 284 643 474 356 709 661 677 402 750 320 362 496
## [253] 490 760 68 116 222 130 94 141 285 735 537 725 758 767 117 629 593 518
## [271] 561 338 407 637 124 553 149 364 264 667 476 480 13 331 619 246 409 42
## [289] 180 440 524 744 465 132 444 631 691 168 395 751 10 685 8 223 469 16
## [307] 534 644 270 536 620 348 431 602 590 698 173 598 79 436 485 704 333 605
## [325] 262 267 301 337 194 358 2 175 211 483 235 521 66 464 721 104 159 253
## [343] 369 120 150 630 80 753 90 258 551 84 650 323 250 450 377 501 601 48
## [361] 233 433 498 638 75 768 545 442 554 242 458 552 263 88 567 473 765 97
## [379] 241 113 226 322 325 209 655 332 574 110 291 484 544 92 699 170 641 342
## [397] 432 129 430 391 49 122 330 493 665 78 502 109 736 142 505 726 303 706
## [415] 626 128 374 576 256 606 257 254 557 448 632 274 312 710 378 563 449 468
## [433] 423 373 491 555 20 327 182 292 71 390 86 381 645 310 357 375 36 397
## [451] 577 157 566 158 509 761 119 164 424 143 573 137 138 278 316 652 455 739
## [469] 611 734 742 314 531 347 67 201 203 266 494 435 4 335 651 743 272 657
## [487] 33 515 289 349 81 386 688 600 98 681 7 99 366 53 489 206 277 582
## [505] 445 564 64 106 508 689 95 280 383 321 680 308 512 610 341 466 6 497
## [523] 588 367 18 171 265 401 616 715 169 684 234 642 252 434 352 438 152 394
## [541] 185 273 305 47 452 102 627 243 687 633 654 227 759 678 475 525 625 679
## [559] 28 451 51 135 198 225 447 354 422 240 419 530 69 527 608 526 618 368
## [577] 382 439 204 640 317 52 586 672 56 467 91 408 399 462 514 191 695 197
## [595] 269 730 63 114 184 118 11 30 344 351 571 740 125 628 202 281 139 165
## [613] 565 105 700 355 532 578 597 103 572 729 107 76 183 343 350 503 50 523
## [631] 707 61 82 495 427 295 457 454 538 146 372 148 244 309 396 219 660 384
## [649] 417 535 594 622 5 446 40 188 293 658 540 756 488 547 589 46 662 59
## [667] 101 623 229 371 9 14 154 487 410 287 416 646 754 656 248 32 717 607
## [685] 413 575 360 562 596 221 259 361 716 74 140 145 200 365 481 297 393 714
## [703] 708 711 112 187 585 232 249 696 17 421 288 21 539 477 60 722 453 719
## [721] 306 752 592 666 621 319 127 214 151 486 230 329 624 683 100 429 163 336
## [739] 609 690 294 406 302 415 648 121 212 581 471 682 747 238 733 39 174 276
## [757] 533 199 693 268 58 380 748 19 126 178 674 580
## Height (summary):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3453 0.9838 1.4190 1.8163 2.1925 12.2052
##
## Available components:
## [1] "order" "height" "ac" "merge" "diss" "call" "method"
# After trying all options, ward is best fit
hc_ward <- agnes(d, method = "ward" )
hc_ward
## Call: agnes(x = d, method = "ward")
## Agglomerative coefficient: 0.9712135
## Order of objects:
## [1] 1 755 702 161 757 387 712 418 186 210 762 213 499 370 549 441 507 550
## [19] 703 580 24 479 668 26 282 215 96 556 307 649 670 25 89 437 456 87
## [37] 559 713 72 218 217 172 296 115 176 339 500 156 425 470 192 694 664 541
## [55] 595 160 216 376 459 517 299 615 741 271 591 31 720 363 93 388 346 724
## [73] 162 568 205 669 764 35 38 543 359 511 746 134 463 738 147 404 673 66
## [91] 464 721 353 492 142 505 726 303 706 626 290 653 403 330 493 665 5 446
## [109] 46 662 229 371 13 331 619 153 260 745 239 324 246 409 59 623 101 751
## [127] 42 180 440 524 744 132 444 631 691 465 83 504 166 189 478 411 614 639
## [145] 266 494 315 146 372 148 535 219 660 594 622 384 417 435 309 396 9 14
## [163] 154 487 410 287 416 646 754 656 248 32 717 413 575 607 360 562 596 221
## [181] 259 41 400 111 228 648 108 569 723 428 749 612 131 261 516 697 313 647
## [199] 74 140 145 200 297 365 481 133 426 196 245 393 714 708 711 15 286 224
## [217] 389 237 604 671 255 520 488 547 589 44 54 207 57 546 663 55 613 361
## [235] 716 112 187 585 232 249 696 3 45 193 405 318 676 12 328 23 340 155
## [253] 124 553 149 362 496 402 750 208 760 320 490 284 643 474 356 709 661 677
## [271] 27 179 518 561 65 220 584 117 629 593 338 407 637 537 725 758 767 571
## [289] 740 597 68 116 222 364 264 667 476 480 94 141 285 735 295 457 130 538
## [307] 10 685 22 345 675 279 300 510 558 513 29 583 460 43 461 718 37 579
## [325] 62 587 144 247 251 77 177 506 560 635 763 275 636 519 334 73 659 692
## [343] 2 175 211 483 4 104 159 235 521 80 753 90 258 551 84 650 97 241
## [361] 150 630 272 657 33 515 289 349 335 651 81 386 688 600 611 734 742 56
## [379] 467 98 681 526 618 7 99 366 242 458 552 263 49 122 78 502 109 736
## [397] 119 164 424 143 573 256 606 257 53 489 206 277 582 445 564 195 311 63
## [415] 114 184 118 76 183 343 350 503 28 451 69 527 608 368 382 439 51 377
## [433] 120 253 369 204 640 317 135 225 447 198 354 422 240 419 530 52 586 672
## [451] 514 633 654 91 408 399 462 191 695 197 269 102 627 243 687 20 327 310
## [469] 182 292 71 390 86 381 645 64 106 508 302 415 529 95 689 21 539 477
## [487] 298 737 412 727 701 542 686 36 397 577 244 157 566 158 314 531 347 509
## [505] 761 280 383 680 321 48 233 433 498 638 137 138 278 442 554 67 201 203
## [523] 170 641 342 432 70 190 385 420 528 705 731 766 92 699 283 167 414 450
## [541] 728 323 250 501 601 308 512 326 634 341 466 610 743 123 522 209 655 316
## [559] 652 332 574 136 570 443 548 129 430 391 6 497 588 367 18 171 265 401
## [577] 616 715 34 181 617 732 152 394 185 273 305 103 572 729 107 11 30 344
## [595] 351 169 234 642 684 252 434 352 438 47 452 168 395 139 165 565 125 628
## [613] 227 759 475 525 625 679 678 730 85 304 202 281 599 231 236 392 379 105
## [631] 700 355 532 578 17 421 288 100 690 151 486 429 163 336 609 178 674 121
## [649] 238 733 212 581 471 682 747 40 188 540 756 293 658 58 380 748 19 126
## [667] 174 423 276 533 199 693 268 294 406 357 375 312 710 378 563 449 468 373
## [685] 491 555 455 739 39 254 557 88 567 473 765 128 374 576 322 325 482 603
## [703] 398 472 75 768 545 113 226 110 291 484 544 274 448 632 60 722 453 719
## [721] 624 683 127 214 230 329 306 752 319 592 666 621 8 223 469 16 534 644
## [739] 173 598 348 431 602 590 698 454 79 436 485 704 333 270 536 620 605 262
## [757] 267 301 337 194 358 50 523 707 61 82 495 427
## Height (summary):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3453 0.9918 1.4537 2.3417 2.4063 40.4477
##
## Available components:
## [1] "order" "height" "ac" "merge" "diss" "call" "method"
# Plot the obtained dendrogram
pltree(hc_ward, cex = 0.6, hang = -1, main="Ward Linkage")
# Cut tree into 3 groups
hc_sub_grp <- cutree(hc_ward, k = 2) # gives us a vector of the cluster assigned for each observation
# Plot our clusters
fviz_cluster(list(data = diabetes_matrix, cluster = hc_sub_grp))
# Compare our results from km.out$cluster with the known values
hcward_results.compare<-data.frame(cbind(as.character(diabetes_data$Outcome), hc_sub_grp))
table(hcward_results.compare)
## hc_sub_grp
## V1 1 2
## 0 167 333
## 1 175 93
hc_ward_accuracy <- (333+175)/768
hc_ward_accuracy
## [1] 0.6614583
# Compare our results from km.out$cluster with the known values
# Conduct divisive (top down) clustering
hc_divisive <- diana(diabetes_matrix)
hc_divisive$dc # get the divisive coefficient - equivalent to ac
## [1] 0.8905177
##### Calculate the epsilon neighborhood - i.e. find optimal eps
kNNdistplot(diabetes_matrix, k=5)
# Implement DBSCAN
db = dbscan(diabetes_matrix, eps = 2, minPts = 5, borderPoints = FALSE)
# Visualize the results
fviz_cluster(list(data = diabetes_matrix, cluster = db$cluster))
# Hullplot from dbscan package isn't confused by unassigned cases
hullplot(diabetes_matrix, db$cluster)
# Compare our results from km.out$cluster with the known values
dbscan_results.compare<-data.frame(cbind(as.character(diabetes_data$Outcome), db$cluster))
table(dbscan_results.compare)
## X2
## X1 0 1 2 3
## 0 35 449 10 6
## 1 48 209 11 0
dbscan_accuracy <- (449+48)/768
dbscan_accuracy
## [1] 0.6471354
Which does clustering method cluster the best based on different types of Outcomes? Justifying with pictures and accuracy rates.
Discuss each of clustering method on this data set. Interpret the results, pros and cons on the data set for each method.
fviz_cluster(list(data = diabetes_matrix, cluster = km.out$cluster))
K-means accuracy is about 67%
fviz_cluster(list(data = diabetes_matrix, cluster = hc_sub_grp))
Hierarchical accuracy is about 66%
fviz_cluster(list(data = diabetes_matrix, cluster = db$cluster))
DBSCAN accuracy is about 64%
From this, we determine that K means clustering is best for this diabetes dataset. Since we know that the response is binary, we can leverage this knowledge to force k means and hierarchical into two clusters. We dont have that same option with DBSCAN resulting in a bit lower overall accuracy.
diabetes_data$cluster = km.out$cluster
diabetes_data$Outcome = as.factor(diabetes_data$Outcome)
diabetes_data$cluster = as.factor(diabetes_data$cluster)
diabetes_data[,c(1:8)] = scale(diabetes_data[,c(1:8)])
training.samples <- diabetes_data$Outcome %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- diabetes_data[training.samples, ]
test.data <- diabetes_data[-training.samples, ]
Logit <- glm(formula = Outcome ~ .,
family = binomial(link = "logit"),
data = train.data)
summary(Logit)
##
## Call:
## glm(formula = Outcome ~ ., family = binomial(link = "logit"),
## data = train.data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.710017 0.288393 -2.462 0.01382 *
## Pregnancies 0.283801 0.143911 1.972 0.04860 *
## Glucose 1.087337 0.132598 8.200 2.40e-16 ***
## BloodPressure -0.296904 0.119728 -2.480 0.01314 *
## SkinThickness -0.003014 0.121949 -0.025 0.98028
## Insulin -0.193470 0.116640 -1.659 0.09718 .
## BMI 0.827599 0.140617 5.885 3.97e-09 ***
## PedigreeFunction 0.349440 0.113738 3.072 0.00212 **
## Age 0.148225 0.162775 0.911 0.36250
## cluster2 -0.268043 0.411298 -0.652 0.51459
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 796.05 on 614 degrees of freedom
## Residual deviance: 579.60 on 605 degrees of freedom
## AIC: 599.6
##
## Number of Fisher Scoring iterations: 5
Logit.step<-stats::step(Logit, direction = "both")
## Start: AIC=599.6
## Outcome ~ Pregnancies + Glucose + BloodPressure + SkinThickness +
## Insulin + BMI + PedigreeFunction + Age + cluster
##
## Df Deviance AIC
## - SkinThickness 1 579.60 597.60
## - cluster 1 580.03 598.03
## - Age 1 580.43 598.43
## <none> 579.60 599.60
## - Insulin 1 582.33 600.33
## - Pregnancies 1 583.53 601.53
## - BloodPressure 1 585.84 603.84
## - PedigreeFunction 1 589.54 607.54
## - BMI 1 620.21 638.21
## - Glucose 1 663.79 681.79
##
## Step: AIC=597.6
## Outcome ~ Pregnancies + Glucose + BloodPressure + Insulin + BMI +
## PedigreeFunction + Age + cluster
##
## Df Deviance AIC
## - cluster 1 580.03 596.03
## - Age 1 580.44 596.44
## <none> 579.60 597.60
## - Insulin 1 582.94 598.94
## - Pregnancies 1 583.54 599.54
## + SkinThickness 1 579.60 599.60
## - BloodPressure 1 586.03 602.03
## - PedigreeFunction 1 589.56 605.56
## - BMI 1 624.23 640.23
## - Glucose 1 665.10 681.10
##
## Step: AIC=596.03
## Outcome ~ Pregnancies + Glucose + BloodPressure + Insulin + BMI +
## PedigreeFunction + Age
##
## Df Deviance AIC
## <none> 580.03 596.03
## - Age 1 583.01 597.01
## - Insulin 1 583.47 597.47
## + cluster 1 579.60 597.60
## + SkinThickness 1 580.03 598.03
## - BloodPressure 1 586.06 600.06
## - Pregnancies 1 588.09 602.09
## - PedigreeFunction 1 590.18 604.18
## - BMI 1 624.88 638.88
## - Glucose 1 673.70 687.70
summary(Logit.step)
##
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure +
## Insulin + BMI + PedigreeFunction + Age, family = binomial(link = "logit"),
## data = train.data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.8845 0.1088 -8.132 4.21e-16 ***
## Pregnancies 0.3359 0.1196 2.810 0.00496 **
## Glucose 1.1047 0.1291 8.556 < 2e-16 ***
## BloodPressure -0.2824 0.1158 -2.439 0.01472 *
## Insulin -0.1970 0.1058 -1.863 0.06250 .
## BMI 0.8282 0.1347 6.149 7.81e-10 ***
## PedigreeFunction 0.3519 0.1134 3.103 0.00191 **
## Age 0.2164 0.1251 1.729 0.08380 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 796.05 on 614 degrees of freedom
## Residual deviance: 580.03 on 607 degrees of freedom
## AIC: 596.03
##
## Number of Fisher Scoring iterations: 5
# Put the predicted probability and class (at 0.5 threshold) at the end of the dataframe
predProbLogit <- predict(Logit.step, type = "response", newdata = test.data)
predClassLogit <- factor(predict(Logit.step, type = "response", newdata=test.data) > 0.45, levels = c(F,T), labels = c("0","1"))
# Generate a confusion matrix and performance statistics on test dataset
confusionMatrix(data=predClassLogit, reference=test.data$Outcome)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 87 17
## 1 13 36
##
## Accuracy : 0.8039
## 95% CI : (0.7321, 0.8636)
## No Information Rate : 0.6536
## P-Value [Acc > NIR] : 3.3e-05
##
## Kappa : 0.5592
##
## Mcnemar's Test P-Value : 0.5839
##
## Sensitivity : 0.8700
## Specificity : 0.6792
## Pos Pred Value : 0.8365
## Neg Pred Value : 0.7347
## Prevalence : 0.6536
## Detection Rate : 0.5686
## Detection Prevalence : 0.6797
## Balanced Accuracy : 0.7746
##
## 'Positive' Class : 0
##
# Calculate AUC - use @y.values to get AUC out
Logit.AUC <- performance(prediction(predProbLogit, test.data$Outcome), measure="auc")@y.values
Logit.AUC
## [[1]]
## [1] 0.8415094
# Set a seed value so get same answer each time we fit
set.seed(123)
# Balance data using weights
## SVM performs poorly when the data set is not balanced.
wts <- 100/table(diabetes_data$Outcome)
# Apply SVM model using linear kernel having default target and rest three as predictors
SVM <- svm(Outcome ~ .,data=train.data, kernel="linear",
cost=1,gamma=1, class.weight=wts,
probability=TRUE, decision.values=TRUE)
# Get the probabilities predicted by SVM
predProbSVM.raw <-predict(SVM, test.data, probability = TRUE)
# Get the probabilitiy of "Yes" from the attributes
predProbSVM <- attributes(predProbSVM.raw)$probabilities[,1]
# Get the probability classes
predClassSVM <- predict(SVM, newdata = test.data)
confusionMatrix(predClassSVM, test.data$Outcome)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 81 13
## 1 19 40
##
## Accuracy : 0.7908
## 95% CI : (0.7178, 0.8523)
## No Information Rate : 0.6536
## P-Value [Acc > NIR] : 0.0001499
##
## Kappa : 0.5501
##
## Mcnemar's Test P-Value : 0.3767591
##
## Sensitivity : 0.8100
## Specificity : 0.7547
## Pos Pred Value : 0.8617
## Neg Pred Value : 0.6780
## Prevalence : 0.6536
## Detection Rate : 0.5294
## Detection Prevalence : 0.6144
## Balanced Accuracy : 0.7824
##
## 'Positive' Class : 0
##
# Calculate ROC statistics for our best logit model
Logit.ROC <- performance(prediction(predProbLogit, test.data$Outcome), measure="tpr", x.measure="fpr")
SVM.ROC <- performance(prediction(predProbSVM, test.data$Outcome), measure="tpr", x.measure="fpr")
SVM.AUC <- performance(prediction(predProbSVM, test.data$Outcome), measure="auc")@y.values
library(class) # new library for KNN modeling
library(rpart) # new library for CART modeling
# New package for making pretty pictures of CART trees
library(rpart.plot)
# Assigning the response
train.def <- train.data$Outcome
test.def <- test.data$Outcome
# Assign explanatory variables
train.gc <- train.data[,c(1:8,10)]
test.gc <- test.data[,c(1:8,10)]
knn.1 <- knn(train.gc, test.gc, train.def, k=1)
knn.5 <- knn(train.gc, test.gc, train.def, k=5)
knn.20 <- knn(train.gc, test.gc, train.def, k=20)
sum(test.def == knn.1)/length(test.def) # For knn = 1
## [1] 0.7320261
sum(test.def == knn.5)/length(test.def) # For knn = 5
## [1] 0.7320261
sum(test.def == knn.20)/length(test.def) # For knn = 20
## [1] 0.7712418
knn <- knn(train.gc, test.gc, train.def, k=28)
sum(test.def == knn)/length(test.def) # For knn = sqrt(n)
## [1] 0.8039216
# Confusion Matrix
confusionMatrix(knn, test.data$Outcome)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 92 22
## 1 8 31
##
## Accuracy : 0.8039
## 95% CI : (0.7321, 0.8636)
## No Information Rate : 0.6536
## P-Value [Acc > NIR] : 3.3e-05
##
## Kappa : 0.5383
##
## Mcnemar's Test P-Value : 0.01762
##
## Sensitivity : 0.9200
## Specificity : 0.5849
## Pos Pred Value : 0.8070
## Neg Pred Value : 0.7949
## Prevalence : 0.6536
## Detection Rate : 0.6013
## Detection Prevalence : 0.7451
## Balanced Accuracy : 0.7525
##
## 'Positive' Class : 0
##
knn.prob <- knn(train.gc, test.gc, train.def, k=28, prob=TRUE)
# KNN model
KNN.prob <- rep(0, length(knn.prob)) ## Initializing
for(i in 1:length(KNN.prob)){
KNN.prob[i] <- ifelse(knn.prob[i] != "0", attr(knn.prob, 'prob')[i], 1 - attr(knn.prob, 'prob')[i])
}
predProbKNN <- prediction(KNN.prob, test.data$Outcome)
KNN.ROC <- performance(predProbKNN, measure="tpr", x.measure="fpr")
KNN.AUC <- performance(prediction(KNN.prob, test.data$Outcome),
measure="auc")@y.values
CART <- rpart(Outcome ~., data=train.data)
summary(CART)
## Call:
## rpart(formula = Outcome ~ ., data = train.data)
## n= 615
##
## CP nsplit rel error xerror xstd
## 1 0.25581395 0 1.0000000 1.0000000 0.05500133
## 2 0.10232558 1 0.7441860 0.7906977 0.05158560
## 3 0.01395349 2 0.6418605 0.6883721 0.04930753
## 4 0.01085271 10 0.5209302 0.7395349 0.05050158
## 5 0.01000000 16 0.4511628 0.7255814 0.05018716
##
## Variable importance
## Glucose BMI Age Insulin
## 36 17 10 10
## BloodPressure cluster PedigreeFunction Pregnancies
## 8 7 6 4
## SkinThickness
## 2
##
## Node number 1: 615 observations, complexity param=0.255814
## predicted class=0 expected loss=0.3495935 P(node) =1
## class counts: 400 215
## probabilities: 0.650 0.350
## left son=2 (388 obs) right son=3 (227 obs)
## Primary splits:
## Glucose < 0.2065977 to the left, improve=53.06460, (0 missing)
## BMI < -0.3985939 to the left, improve=25.82002, (0 missing)
## Age < -0.4031286 to the left, improve=21.71025, (0 missing)
## cluster splits as RL, improve=20.29458, (0 missing)
## Insulin < 0.5484062 to the left, improve=11.70732, (0 missing)
## Surrogate splits:
## Insulin < 0.3575069 to the left, agree=0.699, adj=0.185, (0 split)
## BloodPressure < 0.61452 to the left, agree=0.667, adj=0.097, (0 split)
## Age < 1.297518 to the left, agree=0.663, adj=0.088, (0 split)
## cluster splits as RL, agree=0.663, adj=0.088, (0 split)
## BMI < 0.9839249 to the left, agree=0.657, adj=0.070, (0 split)
##
## Node number 2: 388 observations, complexity param=0.01395349
## predicted class=0 expected loss=0.1907216 P(node) =0.6308943
## class counts: 314 74
## probabilities: 0.809 0.191
## left son=4 (102 obs) right son=5 (286 obs)
## Primary splits:
## BMI < -0.7030017 to the left, improve=10.066900, (0 missing)
## Age < -0.4031286 to the left, improve= 9.205109, (0 missing)
## cluster splits as RL, improve= 7.046622, (0 missing)
## Pregnancies < -0.1024022 to the left, improve= 5.806035, (0 missing)
## Glucose < -0.6065982 to the left, improve= 5.159331, (0 missing)
## Surrogate splits:
## BloodPressure < -2.950302 to the left, agree=0.745, adj=0.029, (0 split)
## PedigreeFunction < 2.792164 to the right, agree=0.745, adj=0.029, (0 split)
##
## Node number 3: 227 observations, complexity param=0.1023256
## predicted class=1 expected loss=0.3788546 P(node) =0.3691057
## class counts: 86 141
## probabilities: 0.379 0.621
## left son=6 (64 obs) right son=7 (163 obs)
## Primary splits:
## BMI < -0.2590736 to the left, improve=15.305370, (0 missing)
## Glucose < 1.426391 to the left, improve=11.063870, (0 missing)
## PedigreeFunction < -0.462913 to the left, improve= 6.218467, (0 missing)
## Age < -0.7432579 to the left, improve= 4.007314, (0 missing)
## BloodPressure < -1.348715 to the right, improve= 3.002903, (0 missing)
## Surrogate splits:
## Age < 1.807712 to the right, agree=0.740, adj=0.078, (0 split)
## PedigreeFunction < -1.045416 to the left, agree=0.727, adj=0.031, (0 split)
##
## Node number 4: 102 observations
## predicted class=0 expected loss=0 P(node) =0.1658537
## class counts: 102 0
## probabilities: 1.000 0.000
##
## Node number 5: 286 observations, complexity param=0.01395349
## predicted class=0 expected loss=0.2587413 P(node) =0.4650407
## class counts: 212 74
## probabilities: 0.741 0.259
## left son=10 (144 obs) right son=11 (142 obs)
## Primary splits:
## Age < -0.4031286 to the left, improve=8.332272, (0 missing)
## cluster splits as RL, improve=6.792232, (0 missing)
## PedigreeFunction < 1.928972 to the left, improve=6.251258, (0 missing)
## Glucose < -0.6691517 to the left, improve=5.980590, (0 missing)
## Pregnancies < 0.1943709 to the left, improve=5.298337, (0 missing)
## Surrogate splits:
## Pregnancies < -0.1024022 to the left, agree=0.787, adj=0.570, (0 split)
## cluster splits as RL, agree=0.780, adj=0.556, (0 split)
## Insulin < -0.6230214 to the right, agree=0.640, adj=0.275, (0 split)
## BloodPressure < 0.09787922 to the left, agree=0.633, adj=0.261, (0 split)
## SkinThickness < -0.9739372 to the right, agree=0.608, adj=0.211, (0 split)
##
## Node number 6: 64 observations, complexity param=0.01395349
## predicted class=0 expected loss=0.328125 P(node) =0.104065
## class counts: 43 21
## probabilities: 0.672 0.328
## left son=12 (47 obs) right son=13 (17 obs)
## Primary splits:
## Glucose < 1.223092 to the left, improve=3.132392, (0 missing)
## Age < -0.4881609 to the left, improve=2.684247, (0 missing)
## Pregnancies < -0.3991752 to the left, improve=1.707615, (0 missing)
## BMI < -1.115221 to the left, improve=1.692434, (0 missing)
## BloodPressure < 0.61452 to the right, improve=1.064808, (0 missing)
## Surrogate splits:
## Age < 2.615519 to the left, agree=0.75, adj=0.059, (0 split)
##
## Node number 7: 163 observations, complexity param=0.01085271
## predicted class=1 expected loss=0.2638037 P(node) =0.2650407
## class counts: 43 120
## probabilities: 0.264 0.736
## left son=14 (110 obs) right son=15 (53 obs)
## Primary splits:
## Glucose < 1.395115 to the left, improve=5.571202, (0 missing)
## PedigreeFunction < -0.4930945 to the left, improve=4.423668, (0 missing)
## BloodPressure < -0.4187616 to the right, improve=3.173023, (0 missing)
## BMI < -0.1449207 to the right, improve=1.966217, (0 missing)
## Age < -0.2330639 to the left, improve=1.636171, (0 missing)
## Surrogate splits:
## PedigreeFunction < 2.778582 to the left, agree=0.693, adj=0.057, (0 split)
## SkinThickness < 2.442516 to the left, agree=0.687, adj=0.038, (0 split)
## Insulin < 3.234012 to the left, agree=0.687, adj=0.038, (0 split)
## BMI < -0.2273645 to the right, agree=0.687, adj=0.038, (0 split)
## Age < 1.467583 to the left, agree=0.687, adj=0.038, (0 split)
##
## Node number 10: 144 observations
## predicted class=0 expected loss=0.1388889 P(node) =0.2341463
## class counts: 124 20
## probabilities: 0.861 0.139
##
## Node number 11: 142 observations, complexity param=0.01395349
## predicted class=0 expected loss=0.3802817 P(node) =0.2308943
## class counts: 88 54
## probabilities: 0.620 0.380
## left son=22 (53 obs) right son=23 (89 obs)
## Primary splits:
## Glucose < -0.6378749 to the left, improve=6.208780, (0 missing)
## PedigreeFunction < 0.4621506 to the left, improve=5.564858, (0 missing)
## Insulin < 0.5440675 to the left, improve=2.770399, (0 missing)
## BloodPressure < 0.9245045 to the right, improve=1.881190, (0 missing)
## BMI < -0.1068697 to the right, improve=1.418053, (0 missing)
## Surrogate splits:
## BloodPressure < -2.795309 to the left, agree=0.634, adj=0.019, (0 split)
## BMI < 1.83373 to the right, agree=0.634, adj=0.019, (0 split)
## PedigreeFunction < -1.128415 to the left, agree=0.634, adj=0.019, (0 split)
##
## Node number 12: 47 observations
## predicted class=0 expected loss=0.2340426 P(node) =0.07642276
## class counts: 36 11
## probabilities: 0.766 0.234
##
## Node number 13: 17 observations
## predicted class=1 expected loss=0.4117647 P(node) =0.02764228
## class counts: 7 10
## probabilities: 0.412 0.588
##
## Node number 14: 110 observations, complexity param=0.01085271
## predicted class=1 expected loss=0.3545455 P(node) =0.1788618
## class counts: 39 71
## probabilities: 0.355 0.645
## left son=28 (95 obs) right son=29 (15 obs)
## Primary splits:
## BloodPressure < -0.4187616 to the right, improve=4.366507, (0 missing)
## Insulin < 1.498564 to the right, improve=3.525327, (0 missing)
## Age < -0.2330639 to the left, improve=2.257921, (0 missing)
## BMI < -0.1449207 to the right, improve=2.168984, (0 missing)
## PedigreeFunction < 0.7790565 to the left, improve=1.897678, (0 missing)
##
## Node number 15: 53 observations
## predicted class=1 expected loss=0.0754717 P(node) =0.08617886
## class counts: 4 49
## probabilities: 0.075 0.925
##
## Node number 22: 53 observations
## predicted class=0 expected loss=0.1886792 P(node) =0.08617886
## class counts: 43 10
## probabilities: 0.811 0.189
##
## Node number 23: 89 observations, complexity param=0.01395349
## predicted class=0 expected loss=0.494382 P(node) =0.1447154
## class counts: 45 44
## probabilities: 0.506 0.494
## left son=46 (17 obs) right son=47 (72 obs)
## Primary splits:
## PedigreeFunction < -0.8069823 to the left, improve=5.964970, (0 missing)
## BloodPressure < 0.9245045 to the right, improve=2.115840, (0 missing)
## Age < 1.680164 to the right, improve=1.877657, (0 missing)
## BMI < 0.2229054 to the right, improve=1.316879, (0 missing)
## SkinThickness < 1.188772 to the right, improve=1.233310, (0 missing)
## Surrogate splits:
## SkinThickness < 1.690269 to the right, agree=0.831, adj=0.118, (0 split)
## Glucose < -0.5753214 to the left, agree=0.820, adj=0.059, (0 split)
## BloodPressure < -1.503707 to the left, agree=0.820, adj=0.059, (0 split)
##
## Node number 28: 95 observations, complexity param=0.01085271
## predicted class=1 expected loss=0.4105263 P(node) =0.1544715
## class counts: 39 56
## probabilities: 0.411 0.589
## left son=56 (41 obs) right son=57 (54 obs)
## Primary splits:
## Age < -0.2330639 to the left, improve=4.409842, (0 missing)
## Insulin < 1.498564 to the right, improve=3.612432, (0 missing)
## Pregnancies < 0.787917 to the left, improve=2.388561, (0 missing)
## PedigreeFunction < 0.7790565 to the left, improve=2.245614, (0 missing)
## cluster splits as RL, improve=2.063618, (0 missing)
## Surrogate splits:
## cluster splits as RL, agree=0.926, adj=0.829, (0 split)
## Pregnancies < -0.1024022 to the left, agree=0.758, adj=0.439, (0 split)
## BloodPressure < 0.4078637 to the left, agree=0.642, adj=0.171, (0 split)
## Insulin < 0.9215275 to the right, agree=0.642, adj=0.171, (0 split)
## BMI < 0.9205066 to the right, agree=0.642, adj=0.171, (0 split)
##
## Node number 29: 15 observations
## predicted class=1 expected loss=0 P(node) =0.02439024
## class counts: 0 15
## probabilities: 0.000 1.000
##
## Node number 46: 17 observations
## predicted class=0 expected loss=0.1176471 P(node) =0.02764228
## class counts: 15 2
## probabilities: 0.882 0.118
##
## Node number 47: 72 observations, complexity param=0.01395349
## predicted class=1 expected loss=0.4166667 P(node) =0.1170732
## class counts: 30 42
## probabilities: 0.417 0.583
## left son=94 (48 obs) right son=95 (24 obs)
## Primary splits:
## PedigreeFunction < 0.2689889 to the left, improve=3.125000, (0 missing)
## Glucose < -0.4814911 to the right, improve=2.329032, (0 missing)
## BloodPressure < 0.9245045 to the right, improve=1.864516, (0 missing)
## Age < 1.467583 to the right, improve=1.864516, (0 missing)
## BMI < 1.370777 to the left, improve=1.162637, (0 missing)
## Surrogate splits:
## Insulin < 0.5874537 to the left, agree=0.708, adj=0.125, (0 split)
## BMI < 1.142471 to the left, agree=0.708, adj=0.125, (0 split)
## BloodPressure < -0.9354024 to the right, agree=0.694, adj=0.083, (0 split)
## Pregnancies < -0.9927214 to the right, agree=0.681, adj=0.042, (0 split)
## Glucose < -0.5753214 to the right, agree=0.681, adj=0.042, (0 split)
##
## Node number 56: 41 observations, complexity param=0.01085271
## predicted class=0 expected loss=0.4146341 P(node) =0.06666667
## class counts: 24 17
## probabilities: 0.585 0.415
## left son=112 (10 obs) right son=113 (31 obs)
## Primary splits:
## Insulin < 1.498564 to the right, improve=4.5476000, (0 missing)
## PedigreeFunction < -0.462913 to the left, improve=2.5892520, (0 missing)
## BloodPressure < 0.2012074 to the right, improve=1.6255160, (0 missing)
## BMI < 1.24394 to the left, improve=0.9656574, (0 missing)
## Glucose < 1.144901 to the left, improve=0.8797118, (0 missing)
## Surrogate splits:
## Glucose < 1.3482 to the right, agree=0.805, adj=0.2, (0 split)
## PedigreeFunction < -0.7194559 to the left, agree=0.780, adj=0.1, (0 split)
## Age < -0.3180962 to the right, agree=0.780, adj=0.1, (0 split)
##
## Node number 57: 54 observations
## predicted class=1 expected loss=0.2777778 P(node) =0.08780488
## class counts: 15 39
## probabilities: 0.278 0.722
##
## Node number 94: 48 observations, complexity param=0.01395349
## predicted class=0 expected loss=0.4791667 P(node) =0.07804878
## class counts: 25 23
## probabilities: 0.521 0.479
## left son=188 (15 obs) right son=189 (33 obs)
## Primary splits:
## Insulin < -0.5969897 to the right, improve=1.970455, (0 missing)
## PedigreeFunction < -0.1535524 to the right, improve=1.968860, (0 missing)
## BloodPressure < -0.1087771 to the right, improve=1.756859, (0 missing)
## Glucose < -0.04361642 to the right, improve=1.678842, (0 missing)
## Age < 1.255002 to the right, improve=1.462607, (0 missing)
## Surrogate splits:
## SkinThickness < -0.91125 to the right, agree=0.792, adj=0.333, (0 split)
## PedigreeFunction < -0.2214608 to the right, agree=0.750, adj=0.200, (0 split)
## Glucose < 0.1127674 to the right, agree=0.729, adj=0.133, (0 split)
## BMI < 0.9648994 to the right, agree=0.729, adj=0.133, (0 split)
##
## Node number 95: 24 observations
## predicted class=1 expected loss=0.2083333 P(node) =0.03902439
## class counts: 5 19
## probabilities: 0.208 0.792
##
## Node number 112: 10 observations
## predicted class=0 expected loss=0 P(node) =0.01626016
## class counts: 10 0
## probabilities: 1.000 0.000
##
## Node number 113: 31 observations, complexity param=0.01085271
## predicted class=1 expected loss=0.4516129 P(node) =0.0504065
## class counts: 14 17
## probabilities: 0.452 0.548
## left son=226 (22 obs) right son=227 (9 obs)
## Primary splits:
## BMI < 1.231256 to the left, improve=1.3346370, (0 missing)
## Glucose < 0.5506421 to the left, improve=1.2798390, (0 missing)
## PedigreeFunction < -0.462913 to the left, improve=1.2476960, (0 missing)
## BloodPressure < 0.4078637 to the right, improve=1.2009930, (0 missing)
## Insulin < 0.63084 to the left, improve=0.8765778, (0 missing)
## Surrogate splits:
## BloodPressure < 0.7953443 to the left, agree=0.806, adj=0.333, (0 split)
## SkinThickness < 1.251459 to the left, agree=0.774, adj=0.222, (0 split)
## Glucose < 0.5819188 to the left, agree=0.742, adj=0.111, (0 split)
##
## Node number 188: 15 observations
## predicted class=0 expected loss=0.2666667 P(node) =0.02439024
## class counts: 11 4
## probabilities: 0.733 0.267
##
## Node number 189: 33 observations, complexity param=0.01395349
## predicted class=1 expected loss=0.4242424 P(node) =0.05365854
## class counts: 14 19
## probabilities: 0.424 0.576
## left son=378 (12 obs) right son=379 (21 obs)
## Primary splits:
## BloodPressure < 0.3045355 to the right, improve=2.2164500, (0 missing)
## PedigreeFunction < -0.2833329 to the right, improve=1.4948380, (0 missing)
## Age < 1.340034 to the right, improve=1.4948380, (0 missing)
## Glucose < -0.3720224 to the right, improve=1.4429510, (0 missing)
## Pregnancies < 1.08469 to the right, improve=0.8864295, (0 missing)
## Surrogate splits:
## Age < 0.8298403 to the right, agree=0.788, adj=0.417, (0 split)
## BMI < -0.4620122 to the left, agree=0.727, adj=0.250, (0 split)
## SkinThickness < -0.3784087 to the right, agree=0.667, adj=0.083, (0 split)
## PedigreeFunction < -0.6938016 to the left, agree=0.667, adj=0.083, (0 split)
##
## Node number 226: 22 observations, complexity param=0.01085271
## predicted class=0 expected loss=0.4545455 P(node) =0.03577236
## class counts: 12 10
## probabilities: 0.545 0.455
## left son=452 (11 obs) right son=453 (11 obs)
## Primary splits:
## Pregnancies < -0.3991752 to the left, improve=1.4545450, (0 missing)
## BloodPressure < 0.2012074 to the right, improve=1.4545450, (0 missing)
## Insulin < -0.2368842 to the left, improve=1.4545450, (0 missing)
## BMI < 0.0009413653 to the right, improve=1.3852810, (0 missing)
## Glucose < 0.5506421 to the left, improve=0.7305195, (0 missing)
## Surrogate splits:
## Glucose < 0.8164946 to the left, agree=0.727, adj=0.455, (0 split)
## BMI < 0.5653642 to the right, agree=0.727, adj=0.455, (0 split)
## SkinThickness < 1.094741 to the right, agree=0.682, adj=0.364, (0 split)
## BloodPressure < 0.2012074 to the right, agree=0.636, adj=0.273, (0 split)
## Insulin < -0.2368842 to the left, agree=0.636, adj=0.273, (0 split)
##
## Node number 227: 9 observations
## predicted class=1 expected loss=0.2222222 P(node) =0.01463415
## class counts: 2 7
## probabilities: 0.222 0.778
##
## Node number 378: 12 observations
## predicted class=0 expected loss=0.3333333 P(node) =0.0195122
## class counts: 8 4
## probabilities: 0.667 0.333
##
## Node number 379: 21 observations
## predicted class=1 expected loss=0.2857143 P(node) =0.03414634
## class counts: 6 15
## probabilities: 0.286 0.714
##
## Node number 452: 11 observations
## predicted class=0 expected loss=0.2727273 P(node) =0.01788618
## class counts: 8 3
## probabilities: 0.727 0.273
##
## Node number 453: 11 observations
## predicted class=1 expected loss=0.3636364 P(node) =0.01788618
## class counts: 4 7
## probabilities: 0.364 0.636
prp(CART)
predClassCART <- predict(CART, newdata = test.data, type = "class")
confusionMatrix(predClassCART, test.data$Outcome)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 88 15
## 1 12 38
##
## Accuracy : 0.8235
## 95% CI : (0.7537, 0.8804)
## No Information Rate : 0.6536
## P-Value [Acc > NIR] : 2.528e-06
##
## Kappa : 0.605
##
## Mcnemar's Test P-Value : 0.7003
##
## Sensitivity : 0.8800
## Specificity : 0.7170
## Pos Pred Value : 0.8544
## Neg Pred Value : 0.7600
## Prevalence : 0.6536
## Detection Rate : 0.5752
## Detection Prevalence : 0.6732
## Balanced Accuracy : 0.7985
##
## 'Positive' Class : 0
##
predProbCART <- predict(CART, newdata = test.data, type = "prob")[,2]
CART.ROC <- performance(prediction(predProbCART, test.data$Outcome), measure="tpr", x.measure="fpr")
CART.AUC <- performance(prediction(predProbCART, test.data$Outcome), measure="auc")@y.values
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
RandomForest <- randomForest(Outcome ~ ., data=train.data, importance = TRUE, ntrees = 500)
summary(RandomForest)
## Length Class Mode
## call 5 -none- call
## type 1 -none- character
## predicted 615 factor numeric
## err.rate 1500 -none- numeric
## confusion 6 -none- numeric
## votes 1230 matrix numeric
## oob.times 615 -none- numeric
## classes 2 -none- character
## importance 36 -none- numeric
## importanceSD 27 -none- numeric
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 615 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
plot(RandomForest)
# Get the probability of "yes" for default from second column
predProbRF <- predict(RandomForest, newdata = test.data, type = "prob")[,2]
# Get the predicted class
predClassRF <- predict(RandomForest, newdata = test.data, type = "response")
RF.ROC <- performance(prediction(predProbRF, test.data$Outcome),
measure="tpr", x.measure="fpr")
RF.AUC <- performance(prediction(predProbRF, test.data$Outcome),
measure="auc")@y.values
Apply all classification models you learned in this class to this diabetes dataset to predict the type of Outcome based on the 8 different kind of chemical components. Which classification method was the best? Use AUC to decide the best one.
Then, apply all classification models you learned in this class to this diabetes dataset to predict the type of Outcome based on the 8 different kind of chemical components and the results from one clustering method from the first part. Explain why you pick one out of all clustering methods. Which classification method was the best? Use AUC to decide the best one.
Does including results from clustering methods increase the accuracy rates?
We see from model summaries that the identified cluster from K means does not play a significant role in the Logit or CART models. CART is also the worst model in terms of AUC. However, we know that cluster is getting used in random trees within the RF model, which is our best performing model. It is possible that there is interaction or correlation between multiple predictors that is not captured within the Logit or CART models that is getting leveled out in the RF model.
Interpret the result from each classifying method. Pros and cons?
### Plot our ROC Performance with Logit as base
plot(Logit.ROC, lwd=2, main = "ROC Comparison for Models on Test Dataset")
# Add SVM
lines(attributes(SVM.ROC)$x.values[[1]], attributes(SVM.ROC)$y.values[[1]],
col="red", lwd=2)
# Add KNN
lines(attributes(KNN.ROC)$x.values[[1]], attributes(KNN.ROC)$y.values[[1]],
col="blue", lwd=2)
# Add CART
lines(attributes(CART.ROC)$x.values[[1]], attributes(CART.ROC)$y.values[[1]],
col="green", lwd=2)
# Add RF
lines(attributes(RF.ROC)$x.values[[1]], attributes(RF.ROC)$y.values[[1]],
col="Brown", lwd=2)
#Add Legend
legend(x=.5, y=.6, c("Logistic (Step)", "SVM", "KNN", "CART", "RF"),
col=c("black", "red", "blue", "green", "Brown"), lwd=c(2,2,2,2,2))
Which one is the best one among all classifiers?
Logit.AUC
## [[1]]
## [1] 0.8415094
SVM.AUC
## [[1]]
## [1] 0.8426415
KNN.AUC
## [[1]]
## [1] 0.8383962
CART.AUC
## [[1]]
## [1] 0.8117925
RF.AUC
## [[1]]
## [1] 0.8559434